The goal of this study is to create an open (CC0) fragment-based melting point model that can be deployed into Open Babel. This model will replace our previous model, which used a limited set of fragments, that has performance statistics of R2: 0.61, AAE: 43.6, and RMSE: 58.1 on the 3500 compound test-set.
Procedure
To make models easily comparable, we use here the same melting point dataset (ONSMP031.csv) as used in our previous models (MPModel007, MeltingPointModel011), keeping the exact same training-set and test-set as before too.
We converted the SMILES from the same training set (ONSMP031TrainingSet.csv) into an sdf file which we then ran through SMF v3.03 to calculate fragments, both atom/bond sequences (lengths: 2-5). This gave us a training-set of 16,015 compounds with 2568 sequence descriptors.
We removed strongly correlated descriptors by using the following code in R (uses the caret package)
## set the working directory
setwd("C:/MeltingPointModel013")
## load the caret package
library(caret)
## load in data
mydata = read.csv(file="TrainingSetWithSequenceFragments.csv",head=TRUE,row.names="molID")
## get the number of columns
ncol(mydata)
[output]
2569
[output]
## remove small variance columns
nzv <-nearZeroVar(mydata)
mydata <- mydata[, -nzv]
ncol(mydata)
[output]
101
[output]
## create the correlation matrix
cmatrix <- cor(mydata);
## remove highly correlated descriptors
highc <- findCorrelation(cmatrix, 0.90, verbose=T);
mydata <- mydata[, -highc];
ncol(mydata)
[output]
74
[output]
write.csv(mydata,file="DescriptorsFeatureSelected.csv")
Here's the resulting file (DescriptorsFeatureSelected.csv). Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.
Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (20180112mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set via the following command line:
The resulting predictions matched the measured values with an R2: 0.61, AAE: 102.9, and RMSE: 58.2. The performance of the model is no better than the previous model previous model
Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and SMF as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 102.9, and RMSE: 58.2.
More modeling of melting points using Open Babel
Researcher: Andrew SID LangIntroduction
The goal of this study is to create an open (CC0) fragment-based melting point model that can be deployed into Open Babel. This model will replace our previous model, which used a limited set of fragments, that has performance statistics of R2: 0.61, AAE: 43.6, and RMSE: 58.1 on the 3500 compound test-set.Procedure
To make models easily comparable, we use here the same melting point dataset (ONSMP031.csv) as used in our previous models (MPModel007, MeltingPointModel011), keeping the exact same training-set and test-set as before too.We converted the SMILES from the same training set (ONSMP031TrainingSet.csv) into an sdf file which we then ran through SMF v3.03 to calculate fragments, both atom/bond sequences (lengths: 2-5). This gave us a training-set of 16,015 compounds with 2568 sequence descriptors.
We removed strongly correlated descriptors by using the following code in R (uses the caret package)
## set the working directory setwd("C:/MeltingPointModel013") ## load the caret package library(caret) ## load in data mydata = read.csv(file="TrainingSetWithSequenceFragments.csv",head=TRUE,row.names="molID") ## get the number of columns ncol(mydata) [output] 2569 [output] ## remove small variance columns nzv <-nearZeroVar(mydata) mydata <- mydata[, -nzv] ncol(mydata) [output] 101 [output] ## create the correlation matrix cmatrix <- cor(mydata); ## remove highly correlated descriptors highc <- findCorrelation(cmatrix, 0.90, verbose=T); mydata <- mydata[, -highc]; ncol(mydata) [output] 74 [output] write.csv(mydata,file="DescriptorsFeatureSelected.csv")Here's the resulting file (DescriptorsFeatureSelected.csv). Again using R, we developed a simple linear model and performed further feature selection to attain the fragment coefficients.
## load in data mydata = read.csv(file="DescriptorsFeatureSelected.csv", head=TRUE,row.names="molID") fit <- lm(mpC ~ 0 + ., data=mydata) summary(fit) [output] Residual standard error: 63.57 on 15942 degrees of freedom Multiple R-squared: 0.7979, Adjusted R-squared: 0.797 F-statistic: 862.2 on 73 and 15942 DF, p-value: < 2.2e-16 [output] ## step-wise feature selection to remove non-significant features library(MASS) bestfit <- stepAIC(fit, upper=~., lower=~1, direction="both") ## display results bestfit$anova [output] Stepwise Model Path Analysis of Deviance Table Initial Model: mpC ~ 0 + (s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 + s30 + s31 + s34 + s36 + s37 + s38 + s39 + s40 + s41 + s43 + s44 + s46 + s47 + s48 + s49 + s50 + s79 + s80 + s81 + s84 + s91 + s92 + s95 + s101 + s108 + s111 + s113 + s115 + s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 + s133 + s146 + s156 + s164 + s168 + s177 + s184 + s185 + s221 + s222 + s223 + s228 + s229 + s280 + s285 + s289) Final Model: mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 + s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 + s46 + s47 + s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 + s101 + s108 + s111 + s113 + s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 + s146 + s164 + s168 + s177 + s184 + s185 + s222 + s228 + s229 + s280 + s285 + s289 - 1 Step Df Deviance Resid. Df Resid. Dev AIC 1 15942 64424489 133066.1 2 - s133 1 17.75295 15943 64424507 133064.1 3 - s41 1 20.95081 15944 64424528 133062.1 4 - s223 1 47.27437 15945 64424575 133060.1 5 - s43 1 396.60900 15946 64424972 133058.2 6 - s221 1 447.07669 15947 64425419 133056.3 7 - s156 1 454.47326 15948 64425873 133054.4 8 - s37 1 1362.72231 15949 64427236 133052.8 9 - s95 1 1297.73838 15950 64428534 133051.1 10 - s115 1 4276.68111 15951 64432810 133050.1 11 - s79 1 7519.50646 15952 64440330 133050.0 [output] summary(bestfit) ## output the final model [output] Call: lm(formula = mpC ~ s3 + s4 + s7 + s10 + s11 + s12 + s13 + s16 + s17 + s18 + s20 + s21 + s22 + s23 + s24 + s25 + s26 + s27 + s28 + s29 + s30 + s31 + s34 + s36 + s38 + s39 + s40 + s44 + s46 + s47 + s48 + s49 + s50 + s80 + s81 + s84 + s91 + s92 + s101 + s108 + s111 + s113 + s118 + s120 + s121 + s124 + s126 + s128 + s129 + s130 + s131 + s146 + s164 + s168 + s177 + s184 + s185 + s222 + s228 + s229 + s280 + s285 + s289 - 1, data = mydata) Residuals: Min 1Q Median 3Q Max -526.98 -37.40 -1.28 37.25 376.06 Coefficients: Estimate Std. Error t value Pr(>|t|) s3 -1.8498 0.2132 -8.675 < 2e-16 *** s4 -25.2846 4.1205 -6.136 8.64e-10 *** s7 6.2586 0.6926 9.036 < 2e-16 *** s10 26.7114 2.3063 11.582 < 2e-16 *** s11 2.8083 0.6881 4.081 4.50e-05 *** s12 51.9087 2.5122 20.662 < 2e-16 *** s13 8.4622 0.6209 13.628 < 2e-16 *** s16 3.7174 0.9865 3.768 0.000165 *** s17 7.8643 1.9408 4.052 5.10e-05 *** s18 5.8715 0.7096 8.274 < 2e-16 *** s20 16.7968 1.4447 11.627 < 2e-16 *** s21 0.6915 0.4553 1.519 0.128837 s22 -1.4671 0.3901 -3.761 0.000170 *** s23 -2.9295 1.2455 -2.352 0.018676 * s24 -7.0190 0.6391 -10.983 < 2e-16 *** s25 1.0471 0.5623 1.862 0.062603 . s26 -2.2683 1.3571 -1.671 0.094649 . s27 -10.0463 2.6375 -3.809 0.000140 *** s28 -2.3992 0.8358 -2.871 0.004103 ** s29 19.5947 1.4171 13.828 < 2e-16 *** s30 47.0353 2.9064 16.183 < 2e-16 *** s31 -15.4312 1.2317 -12.528 < 2e-16 *** s34 7.3887 1.0701 6.905 5.22e-12 *** s36 10.9887 1.1125 9.878 < 2e-16 *** s38 -12.8826 1.6467 -7.824 5.46e-15 *** s39 9.1593 1.4341 6.387 1.74e-10 *** s40 18.2006 1.7630 10.323 < 2e-16 *** s44 -2.8801 1.1500 -2.505 0.012272 * s46 -2.1130 1.2342 -1.712 0.086924 . s47 2.2720 1.1813 1.923 0.054461 . s48 -3.2273 1.4964 -2.157 0.031039 * s49 7.8352 2.2002 3.561 0.000370 *** s50 -11.9132 1.4625 -8.146 4.04e-16 *** s80 28.2482 2.5847 10.929 < 2e-16 *** s81 -5.3919 1.7741 -3.039 0.002375 ** s84 16.1194 2.4100 6.689 2.33e-11 *** s91 14.8185 3.1938 4.640 3.52e-06 *** s92 -2.4609 1.7168 -1.433 0.151751 s101 -9.6603 3.3528 -2.881 0.003966 ** s108 -45.5597 2.1051 -21.643 < 2e-16 *** s111 3.5281 1.7623 2.002 0.045299 * s113 -8.3954 1.5665 -5.359 8.47e-08 *** s118 -12.3305 1.5313 -8.053 8.68e-16 *** s120 -7.6604 1.1297 -6.781 1.24e-11 *** s121 4.0442 1.3917 2.906 0.003665 ** s124 -6.6457 0.7902 -8.410 < 2e-16 *** s126 2.3230 1.0538 2.205 0.027503 * s128 3.9446 1.0501 3.757 0.000173 *** s129 3.6411 0.1770 20.575 < 2e-16 *** s130 -5.4580 1.5849 -3.444 0.000575 *** s131 -25.8821 2.5100 -10.311 < 2e-16 *** s146 -2.1609 1.0674 -2.024 0.042950 * s164 -5.3542 1.1379 -4.705 2.56e-06 *** s168 16.9584 1.1291 15.019 < 2e-16 *** s177 3.4397 1.5135 2.273 0.023065 * s184 -3.0319 1.0308 -2.941 0.003274 ** s185 -12.8910 1.3574 -9.497 < 2e-16 *** s222 10.6815 2.2331 4.783 1.74e-06 *** s228 7.2849 1.3844 5.262 1.44e-07 *** s229 14.5218 2.3799 6.102 1.07e-09 *** s280 12.2807 2.0545 5.978 2.31e-09 *** s285 -8.2762 2.5626 -3.230 0.001242 ** s289 -5.4504 2.4810 -2.197 0.028045 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 63.56 on 15952 degrees of freedom Multiple R-squared: 0.7978, Adjusted R-squared: 0.797 F-statistic: 999.3 on 63 and 15952 DF, p-value: < 2.2e-16 [output]Open Babel Implementation
Following the instructions on the OpenBabel plugin development page, we deployed the melting point model plugin by creating the appropriate text file (20180112mpC.txt) and by updating the plugindefines.txt file in the data folder. We then used OpenBabel to calculate the melting points of the 3500 compound test set via the following command line:
The resulting predictions matched the measured values with an R2: 0.61, AAE: 102.9, and RMSE: 58.2. The performance of the model is no better than the previous model previous model
Results
We successfully created an open fragment-based model that can easily be added to OpenBabel. We used R to create the linear model and SMF as the fragment calculator for an open data set of melting points. The test-set statistics R2: 0.61, AAE: 102.9, and RMSE: 58.2.